This notebook shows how to run Debugger's profiling functionality. We use a tutorial from PyTorch website that fine-tunes a pre-trained Mask R-CNN model for instance segmentation on a new dataset: https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html
First let's download the dataset
! wget https://www.cis.upenn.edu/~jshi/ped_html/PennFudanPed.zip
! unzip PennFudanPed.zip
! mv PennFudanPed entry_point/PennFudanPed
First we define the profiler configuration that instructs Debugger to sample system metrics at 500ms and collect detailed metrics (dataloading times, python profiling, GPU operators...) from step 10 to 12.
from sagemaker.debugger import ProfilerConfig, FrameworkProfile, DetailedProfilingConfig, DataloaderProfilingConfig, PythonProfilingConfig, PythonProfiler
profiler_config = ProfilerConfig(
system_monitor_interval_millis=100,
framework_profile_params=FrameworkProfile(
detailed_profiling_config=DetailedProfilingConfig(
start_step=10,
num_steps=2
)
)
)
Next we start the SageMaker training job. Debugger will auomatically apply the necessary configuration in the training script and model, to obtain the profiling data. The instance segmentation model training is defined in train.py. Let's take a look at the training script:
! cat entry_point/train.py
import sagemaker
from sagemaker.pytorch import PyTorch
image_uri = f"763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:1.6.0-gpu-py36-cu110-ubuntu18.04"
estimator = PyTorch(
role=sagemaker.get_execution_role(),
instance_count=1,
image_uri=image_uri,
instance_type='ml.p3.2xlarge',
source_dir='entry_point',
entry_point='train.py',
profiler_config=profiler_config
)
estimator.fit(wait=False)
Once the job is running we can use smdebug library to access and query the data as the training is still in progress. We can now plot the profiling data such as timeline charts, heatmaps or download the profiler report from Amazon S3.
import smdebug
from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob
jobname=estimator.latest_training_job.job_name
tj = TrainingJob(jobname, 'us-west-2')
To check if the system and framework metrics are available from the S3 URI
tj.wait_for_sys_profiling_data_to_be_available()
tj.wait_for_framework_profiling_data_to_be_available()
To create system and framework reader objects after the metric data become available
system_metrics_reader = tj.get_systems_metrics_reader()
framework_metrics_reader = tj.get_framework_metrics_reader()
Histogram visualizations of system metrics:
from smdebug.profiler.analysis.notebook_utils.metrics_histogram import MetricsHistogram
metrics_histogram = MetricsHistogram(system_metrics_reader)
metrics_histogram.plot(
starttime=0,
endtime=system_metrics_reader.get_timestamp_of_latest_available_file(),
select_dimensions=["CPU", "GPU", "I/O"],
select_events=["total"]
)
Step timeline chart shows the step duration (time for forward and backward pass) as the training is in progress. X-axis shows the training job duration and y axis indicates the step duration.
from smdebug.profiler.analysis.notebook_utils.step_timeline_chart import StepTimelineChart
view_step_timeline_chart = StepTimelineChart(framework_metrics_reader)
Timeseries charts show the utilization across training job duration
from smdebug.profiler.analysis.notebook_utils.timeline_charts import TimelineCharts
view_timeline_charts = TimelineCharts(
system_metrics_reader,
framework_metrics_reader,
select_dimensions=["CPU", "GPU", "I/O"], # optional
select_events=["total"] # optional
)
The heatmap visualization helps to more easily identify bottlenecks where utilization on GPU is low and CPU utilizaion is high.
from smdebug.profiler.analysis.notebook_utils.heatmap import Heatmap
view_heatmap = Heatmap(
system_metrics_reader,
framework_metrics_reader,
select_dimensions=["CPU", "GPU", "I/O"],
plot_height=450
)
We can now generate the merged timeline, that gives a detailed view of system and framework metrics. We can load the json file into the Chrome trace viewer.
import time
from smdebug.profiler.analysis.utils.merge_timelines import MergedTimeline
combined_timeline = MergedTimeline(tj.profiler_s3_output_path, output_directory="./")
combined_timeline.merge_timeline(0, time.time())
We can look at the profiling report where we can see that the training suffered 37\% of the time from CPU bottlenecks and 21\% of the time from IO bottlenecks.
Next we run a multi GPU training jobs for which we use PyTorch's DataParallel.
import sagemaker
from sagemaker.pytorch import PyTorch
image_uri = f"763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:1.6.0-gpu-py36-cu110-ubuntu18.04"
estimator = PyTorch(
role=sagemaker.get_execution_role(),
instance_count=1,
image_uri=image_uri,
instance_type='ml.p3.8xlarge',
source_dir='entry_point',
entry_point='train.py',
profiler_config=profiler_config
)
estimator.fit(wait=False)
jobname=estimator.latest_training_job.job_name
tj = TrainingJob(jobname, 'us-west-2')
tj.wait_for_sys_profiling_data_to_be_available()
tj.wait_for_framework_profiling_data_to_be_available()
system_metrics_reader = tj.get_systems_metrics_reader()
framework_metrics_reader = tj.get_framework_metrics_reader()
Now let's compare the heatmap to previous training run
from smdebug.profiler.analysis.notebook_utils.heatmap import Heatmap
view_heatmap = Heatmap(
system_metrics_reader,
framework_metrics_reader,
select_dimensions=["CPU", "GPU", "I/O"],
plot_height=750
)